Data Analyst NanoDegree - Explore and Summarize Data

Exploring Prosper Loan Data

This project analyzes the loan data provided by Prosper (www.prosper.com) a peer to peer lending platform. The feature description about the dataset can be found at: A variable dictionary can be found here: https://docs.google.com/spreadsheets/d/1gDyi_L4UvIrLTEC6Wri5nbaMmkGmLQBk-Yx3z0XDEtI/edit#gid=0.

I will perform three types of analyses, Univariate, Bivariate and Multivariate using preselected variables among the 81 that are part of the dataset and reflect on my findings in the concluding section.

Here is a summary of the dataset

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...
## [1] 113937     81

Univariate Plots Section

I start the Univariate analysis by developing graphs of select variables of my interest. I have divided these variables into 4 broad categories along with the variables represented by them: * Borrower - BorrowerRate, Occupation, EmploymentStatus, TotalInquiries, CurrentDeliquencies, RevolvingCreditBalance, OnTimeProsperPayments, DebtToIncomeRatio, IncomeRange * Loan - Term, LoanStatus, ProsperPrincipalBorrowed, LoanOriginationDate, LoanMonthsSinceOrigination, MonthlyLoanPayment, PercentFunded, ListingCategory * Credit - CreditScoreRangeLower, CreditScoreRangeUpper, FirstRecordedCreditLine, TotalCreditLinespast7years, ProsperRating, ProsperScore * Lender - LenderYield, EstimatedReturn, EstimatedLoss

I will explore each variable to develop intuition about each.

Borrower

BorrowerRate

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

I used a bin size of 0.5 in the histogram above. I shows a considerable peak at 0.35% rate with more than 5000 loans. The mean is around 0.2%

From the box plot we can see some relatively higher rates (max at 0.5%), although they seem reasonable.

Occupation

##  [1] Other                              Professional                      
##  [3] Skilled Labor                      Executive                         
##  [5] Sales - Retail                     Laborer                           
##  [7] Food Service                       Fireman                           
##  [9] Waiter/Waitress                    Construction                      
## [11] Computer Programmer                Sales - Commission                
## [13] Retail Management                  Engineer - Mechanical             
## [15] Military Enlisted                  Clerical                          
## [17]                                    Teacher                           
## [19] Clergy                             Accountant/CPA                    
## [21] Attorney                           Nurse (RN)                        
## [23] Analyst                            Nurse's Aide                      
## [25] Investor                           Realtor                           
## [27] Flight Attendant                   Nurse (LPN)                       
## [29] Military Officer                   Food Service Management           
## [31] Truck Driver                       Administrative Assistant          
## [33] Police Officer/Correction Officer  Social Worker                     
## [35] Tradesman - Mechanic               Medical Technician                
## [37] Professor                          Postal Service                    
## [39] Civil Service                      Pharmacist                        
## [41] Tradesman - Electrician            Scientist                         
## [43] Dentist                            Engineer - Electrical             
## [45] Architect                          Landscaping                       
## [47] Tradesman - Carpenter              Bus Driver                        
## [49] Tradesman - Plumber                Engineer - Chemical               
## [51] Doctor                             Chemist                           
## [53] Student - College Senior           Principal                         
## [55] Teacher's Aide                     Pilot - Private/Commercial        
## [57] Religious                          Homemaker                         
## [59] Student - College Graduate Student Student - Technical School        
## [61] Psychologist                       Biologist                         
## [63] Student - College Sophomore        Judge                             
## [65] Student - College Junior           Car Dealer                        
## [67] Student - Community College        Student - College Freshman        
## 68 Levels:  Accountant/CPA Administrative Assistant Analyst ... Waiter/Waitress

The most popular Occupation entered is ‘Other’ and the next popular is the ‘Professional’ category. We can group different occupation categories and revisit this chart. For example there are multiple categories beginning with Student. There are about 70+ types of occupations listed in the dataset. I will try to reduce that to around 7

With the new Occupation categorization, we see the most of the loans come from “General Service” and “Blue Collar Services” jobs, although the number of people belonging to these must be high as well and hence a large number of loans.

EmploymentStatus

##                    Employed     Full-time Not available  Not employed 
##          2255         67322         26355          5347           835 
##         Other     Part-time       Retired Self-employed 
##          3806          1088           795          6134

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   67.00   96.07  137.00  755.00    7625

A large number of borrowers are employed, the number being 67322+26355 or ~93000 including both Full-Time and Employed categories. The EmploymentStatusDuration graph shows a steady decline in the number of loans holders as the employment duration increases. Median is 67 month whereas max is 755 months. From the box plot for EmploymentStatusDuration we don’t see any unusual outlier.

TotalInquiries

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   2.000   4.000   5.584   7.000 379.000    1159

TotalInquiries have an impact on credit and the summary and the graph show that although the graph peaks in the range of 0 to 6 enquiries, there are few who have inquired for more than 100 times and the max is 379 which is unusually high. I converted the scale on Y axis to log. The box plot shows a quite a few outliers having greater than 370-380 inquiries.

RevolvingCreditBalance

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3121    8549   17600   19520 1436000    7604

The log scaled histogram of RevolvingCreditBalance shows a large number of loans with low credit balance but there are many with greater than 50000 and even few with 0.5 million as credit balance. This is much clearly visible in the accompanying box plot, where there are multiple points beyong the 100,000.

OnTimeProsperPayments

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    9.00   15.00   22.27   32.00  141.00   91852

This feature shows the number of on time payments the borrower had made on Prosper loans at the time they created this listing. This value will be null if the borrower has no prior loans and hence there are around 92000 entries with NULL. Although this would depend on Term of the loan, the count of on time loans seems to decline as the duration of ontime payment rises. The box plot doesn’t show any alarming outliers.

DebtToIncomeRatio

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

The DebtToIncomeRatio is an important feature in determining the credit score of a loan as it indicates the probability of default as the debt-to-income ratio increases. In this histogram, the ratio is 0.2 to 0.3 for more than 30000, the peak. It then declines, but there are few loans which have the ratio as 10, an unusually high number. There are quite a few outliers shown in the box plot approaching the debt to income ratio of 10.

IncomeRange

##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##            621           7274          17337          32192          31050 
## $75,000-99,999  Not displayed   Not employed 
##          16916           7741            806

The IncomeRange histogram looks like a normally distributed one wih a large number of loan takers reporting a salary in 25k-50k and 50k-75k ranges. The number of loans with lower and higher salary ranges are small for understandable reasons.

StatedMonthlyIncome

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750000

The StatedMonthlyIncome also shows an approximate normal distribution, with monthly salary ranging from 0 to 175k. The median is 4667. The accompanying box plot shows multiple outliers on both sides of the median.

BorrowerState

## 
##          AK    AL    AR    AZ    CA    CO    CT    DC    DE    FL    GA 
##  5515   200  1679   855  1901 14717  2210  1627   382   300  6720  5008 
##    HI    IA    ID    IL    IN    KS    KY    LA    MA    MD    ME    MI 
##   409   186   599  5921  2078  1062   983   954  2242  2821   101  3593 
##    MN    MO    MS    MT    NC    ND    NE    NH    NJ    NM    NV    NY 
##  2318  2615   787   330  3084    52   674   551  3097   472  1090  6729 
##    OH    OK    OR    PA    RI    SC    SD    TN    TX    UT    VA    VT 
##  4197   971  1817  2972   435  1122   189  1737  6842   877  3278   207 
##    WA    WI    WV    WY 
##  3048  1842   391   150

The borrowers are in large numbers from CA at more than 14000, hence I had to log transform the Y axis to plot the state histogram.

Loan

Term

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   40.83   36.00   60.00

The loan terms are 3 categories, 12, 36 and 60 months with 36 being the most preferred loan term.

MonthlyLoanPayment

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   131.6   217.7   272.5   371.6  2252.0

Most monthly loan payments lie between 100 and 300, and then few are even above 1000. There are some with 0 payments which looks unusual. The box plot shows the distribution with some very low values, approaching 0 and very few high values in 1000s.

LoanOriginalAmount

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

The loans range from a minimum of 1000 to 35000, with median at 6500. 10000 and 15000 are popular loan amounts as well. The distribution shown in the box plot with outliers on the higher range seems fine.

LoanStatus

## List of 2
##  $ axis.text.x:List of 11
##   ..$ family       : NULL
##   ..$ face         : chr "bold"
##   ..$ colour       : NULL
##   ..$ size         : num 10
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : num 60
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.y:List of 11
##   ..$ family       : NULL
##   ..$ face         : chr "bold"
##   ..$ colour       : NULL
##   ..$ size         : num 10
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE
## 
##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304
##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket. Most loans are current, with 56k loans; more than 38k loans are completed. About 12k loans are chargedoff, there are 5k defaults and about 2k loans are past due for multiple durations.

LoanMonthsSinceOrigination

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     6.0    21.0    31.9    65.0   100.0

LoanMonthsSinceOrigination: Number of months since the loan originated. A large number of loansare relateively new with less than one year in age. The box plot doesn’t show any outliers, as expected.

Credit

CreditScoreRangeLower and CreditScoreRangeUpper

I will combine these 2 into a single variable and plot a histogram

## [1] 0.424173
## [1] 0.06456199
##    Length     Class      Mode 
##    113937 character character

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   669.5   689.5   695.1   729.5   889.5     591

The graph shows that most loans go to borrowers with credit scores between 640 and 740. About 42% of loans have credit score between 640 and 740. About 6% of loans are to borrowers with credit below 600 or no credit rating at all. Except the one outlier nearing 0-10, there is nothing unusual about the distribution in the box plot.

BankCardUtilization

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.310   0.600   0.561   0.840   5.950    7604

BankCardUtilization measures the credit utilization when the credit was pulled, the median is 0.6. There are more than 7500 NAs, which means they were taking credit first time. The density plot shows peaks at 0 and 1. The box plot distribution shows a number of high BankcardUtilization scenarios as outliers.

ProsperScore

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    4.00    6.00    5.95    8.00   11.00   29084

A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009. The graph shows an almost normal distribution and no outliers in the box plot.

CreditGrade and ProsperRating

##           A    AA     B     C     D     E    HR    NC 
## 84984  3315  3509  4389  5649  5153  3289  3508   141
##           A    AA     B     C     D     E    HR 
## 29084 14551  5372 15581 18345 14274  9795  6935

I have combined the 2 pre2009 and post2009 credit rating measures in the dataset to a new feature called ProsperCreditRating. After plotting a histogram of the same w can see that most of the loans belong to Ratings, B, C and D which is not unusual.

Univariate Analysis

What is the structure of your dataset?

A cursory look at all the variables shows that all the variables could be part of a model to determine a measure of risk involved for each loan entry. And based on individual data per loan one could create another model to determine the credit rating of a borrower for a particular loan. Here I assume that credit rating is a dynamic metric which is evaluated on a regular basis based on the repayment pattern during loan servicing perdied. At a macro level we could also determine the state of peer to peer lending business from the perspective of Prosper.

What is/are the main feature(s) of interest in your dataset?

When I use popular credit related apps such as CreditKarma and CreditWise I am shown some of the factors that determine my credit rating, these factors include - Oldest credit line, % of credit used, available credit across all credit lines, number of credit enquiries, number of accounts opened and closed, on time payments. This sounds much simpler than the number of variables at hand in the Prosper loan data. Hence I tried to categorize all the 81 variables in broad categories as per my own understanding of the credit industry. The categories and respective variables of my interest include: Borrower - BorrowerRate, Occupation, EmploymentStatus, TotalInquiries, CurrentDeliquencies, RevolvingCreditBalance, OnTimeProsperPayments, DebtToIncomeRatio, IncomeRange, BorrowerState
Loan - Term, LoanStatus, LoanMonthsSinceOrigination, MonthlyLoanPayment, ListingCategory Credit - CreditScoreRangeLower, CreditScoreRangeUpper, ProsperRating, ProsperScore

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Borrower - StatedMonthlyIncome, InvestmentFromFriendsCount, InvestmentFromFriendsAmount, BorrowerState, AmountDelinquent, InquiriesLast6Months, DelinquenciesLast7Years, TradesNeverDelinquent, BankcardUtilization, IsBorrowerHomeOwner Loan - PercentFunded, ProsperPrincipalBorrowed, LoanOriginationDate, ListingCategory Credit - FirstRecordedCreditLine, TotalCreditLinespast7years

Did you create any new variables from existing variables in the dataset?

Yes, following are the variables I derived from the dataset: * CreditScoreRange by combining the Lower and Upper as a string * MeanCreditScore by averaging the Lower and Upper as a number * Occupation grouped together to create smaller number of categories - Occupation_new * Combined ProsperRating..Alpha and CrediGrade to create ProsperCreditRating

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

For almost all variables I remove NA by default if any from the variable that I was going to plot. Apart from this I also did log and squre root transofrmations of disproportionate variables such as RevolvingCreditBalance, State and LoanStatus histograms. Occupation_new is a kind of redefinition of Occupation categorical variable which I used to group together similar categories to make analysis simpler. I also tried to set limits on x or y axes for better viewing of the data and remove outliers.

Bivariate Plots Section

I will be exploring the relationship among pairs of variables in this section. Beginning with a correlation matrix chart to get an idea in general about the relationship among variables I got me interested in the Univariate analysis section.

The generated graph is extremely busy but I am able to identify some of the variables which don’t show much correlation with others. Variables such as Term, EmploymentStatusDuration, ProsperScore, LoanCurrentDaysDelinquent, DelinquenciesLast7Years, ProsperPrincipalOutstanding are not showing correlation with others and hence I removed them in the following correlation matrix to reduce the density of information in the matrix.

There are some interesting observations in the graph above. First of all we can ignore the correlation of 1 among MeanCreditScore and the 2 CreditScoreRange [Upper/Lower] variables, since the mean is derived from the 2. The 4 variables, OpenRevolvingMonthlyPayment, OpenRevolvingAccounts, OpenCreditLines and TotalCreditLinesPast7Years are all strongly related to each other ranging from 0.5 to 0.82 Next highest correlation is between RevolvingCreditBalance and OpenRevolvingMonthlyPayment at 0.76, indicating that the payment tends to be usually equal to the Balance amount although it may not be always the case. The correlation between TotalInquiries and InquiriesLast6Months is also strong at 0.74, which is an obviously strong relationship. TotalProsperLoans and OnTimeProsperPayments correlation is at 0.7 which indicates that the On time payment record is a strong factor in getting the borrowers more loans. On the negative correlation front, variables BorrowerRate, MeanCreditScore show a correlation of -0.46 while MeanCreditScore and BankCardUtilization show a correlation score of -0.41.

In my analysis further I will also include the categorical variables along with continuous ones.

MeanCreditScore vs LoanOriginalAmount

## 
##  Pearson's product-moment correlation
## 
## data:  pd$MeanCreditScore and pd$LoanOriginalAmount
## t = 122.07, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3357190 0.3460095
## sample estimates:
##       cor 
## 0.3408745

We can observe a general increasing trend of MeanCreditScore as the LoanOriginalAmount increases. The loan amount is square root transformed. Although the graph shows lot of variation in Credit Score for most of the loan amounts, except the very high ones. At high loan amounts the Credit Scores are found to be among the highest.

MeanCreditScore vs BorrowerRate

## 
##  Pearson's product-moment correlation
## 
## data:  pd$MeanCreditScore and pd$BorrowerRate
## t = -175.17, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4661358 -0.4569730
## sample estimates:
##        cor 
## -0.4615667

The correlation shows a negative relation of -0.46 as we had seen in the correlation matrix. The chart shows a negative relation, except when the interest rate is less than around 0.6% where it shows a sharp increase in the trend. Hence it can be deduced that a very small interest rate is good for the credit score to a threshold after which the credit score declines.

The facet wrap graph histogram shows an approximate normal distribution for loan counts in 650 to 750 range and for higher credit scores the distribution is right skewed due to lower rate charged for them.

MeanCreditScore vs DebtToIncomeRatio

## 
##  Pearson's product-moment correlation
## 
## data:  pd$MeanCreditScore and pd$DebtToIncomeRatio
## t = -4.2633, df = 104800, p-value = 2.016e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.019221406 -0.007114667
## sample estimates:
##         cor 
## -0.01316852

I see a very small negative correlation and no strong relationship between DebtToIncomeRatio and MeanCreditScore. The second curve also shows a very non-obvious relationship. I was expecting the credit score to fall with a rising debt to income ratio, but since there are many other factors considered to calculate the credit score, from the data we have, the DebtToIncome ratio doesn’t seem to be playing a role.

Changing the gears a bit, I compare the newly created Occupation_new categorical variable with the StatedMonthlyIncome to understand whether my intuition regarding categorization on multiple occupation categories in the dataset was correct or not.

StatedMonthlyIncome vs Occupation_new

It can be observed from the box plot, that monthly income is highest for the Distinguished Service, followed by STEM related services, Medical services respectively. As expected Students are the lowest earners. I will now try to pair the Occupation_new with MeanCreditScore to get a roughly similar relationship.

MeanCreditScore vs Occupation_new

Filtering out the outliers, the box plots fpr MeanCreditScore show a similar relationship as was with the StatedMonthlyIncome. Although since the range for credit score is much lower than the income the differences aren’t very clear looking at the 1st and 3rd quantiles of the box plots.

MeanCreditScore vs ProsperCreditRating

I charted two different types of graphs, a box plot and a facet wrap. In the box plot it can be seen that median credit score is highest for AA and then it goes down for A, B upto E. The median for HR is higher than E’s median due to outliers with very high MeanCreditScore.

The facet chart shows an approximate normal distribution for ProsperCreditRating categories for MeanCreditScore present in the range 650 to 750. At higher credit score we can see the distribution tending to be right skewed due to dwindling number with poor Prosper ratings at higher MeanCreditScore.

MonthlyLoanPayment vs MeanCreditScore

## 
##  Pearson's product-moment correlation
## 
## data:  pd$MeanCreditScore and pd$MonthlyLoanPayment
## t = 102.99, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2871995 0.2978465
## sample estimates:
##      cor 
## 0.292532

The correlation shows a value of around 0.3 which is also reflected in the 2 charts that follow.

MeanCreditScore vs Term

## 
##  Pearson's product-moment correlation
## 
## data:  pd$MeanCreditScore and pd$Term
## t = 42.852, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1205304 0.1319881
## sample estimates:
##       cor 
## 0.1262635

The correlation is very low at 0.12. The box plot shows that for loans of shorter term as well as longer term the MeanCreditScore is higher, with a greater inter-quartile range for 12 month term. The loan term of 36 months is much larger in number of loans and correlate with a larger range of credit score even though the interquartile range is smaller.

MeanCreditScore vs IsBorrowerHomeOwner

The box plot for home onwership clearly shows a higher score for those who own their homes vs those who rent. This may be due to the fact that since mostly home ownership comes with a big asset backed loan and servicing that certainly helps the credit score.

MeanCreditScore vs LoanStatus

The categorical variable LoanStatus shows an obvious relationship with MeanCreditScore. The credit score is high for Current and Completed loans. But for Charged-off and Defaulted loans are clearly lower. The Past Due categories are flat across the 6 past due categories. An interesting thing to note is that for FinalPaymentInProgress and Completed loans the credit scores are lower than Current loans.

MeanCreditScore vs IncomeRange

Regarding the income range plot, it can be seen that the clearly distinctive categories are Not displayed on the lower end and $100,000+ and $75,000 to $99,000 ranges on the higher end. Not displayed category would ceratinly be higher risk would demand higher borrower rates due to may be poor credit history and an unstable income. For the other ranges there seems to be a high overlap.

MeanCreditScore vs BankcardUtilization

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.310   0.600   0.561   0.840   5.950    7604
## 
##  Pearson's product-moment correlation
## 
## data:  pd$MeanCreditScore and pd$BankcardUtilization
## t = -144.58, df = 106330, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4103489 -0.4003027
## sample estimates:
##       cor 
## -0.405338

A slight negative trend is visible for the greater usage of the credit brings down the credit score.

MeanCreditScore vs OpenRevolvingMonthlyPayment

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   114.0   271.0   398.3   525.0 14980.0
## 
##  Pearson's product-moment correlation
## 
## data:  pd$MeanCreditScore and pd$OpenRevolvingMonthlyPayment
## t = 47.252, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1332758 0.1446941
## sample estimates:
##       cor 
## 0.1389896

A slight positive trend is visible here, signifying that an increase in revolving monthly payment results in higher credit usage and hence lower credit score.

MeanCreditScore vs LoanMonthsSinceOrigination

An interesting trend can be seen in this plot that given that there are just 3 loan terms, 12, 36 and 60 months. And hence all the loans with LoanMonthsSinceOrigination greater than 60 show a steady decline in their credit scores. The score remains roughly around 700 till about 60 month mark. we can correlate this to the LoanStatus with Past Due status which have low credit scores.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I got a rough idea about relationships among continuous variables by using the correlation matrix plot. I then eliminated some of the variables based on the result of correlation matrix and compared relationships between pairs of variables of my interest. My primary variable of interest in this analysis was MeanCreditScore which I paired with LoanOriginalAmount, BorrowerRate, DebtToIncomeRatio, ProsperCreditRating, Occupation_new, MonthlyLoanPayments, Term, LoanStatus and BankCardUtilization. I used multiple plot types to analyze the variables such as line, box, point or scatter plots and facet grids.

Some relatively strong relationships of MeanCreditScore are with BankCardUtilization, LoanOriginalAmount, BorrowerRate and MonthlyLoanPayment which shows that taking loan helps the credit score and as long as the loan is serviced responsibly. Greater the loan amount and greater the monthly payments, better is the score. The BorrowerRate which is a measure of risk associated with the loan, is overall negatively related to MeanCreditScore. Although there is a threshold, upto which an increase in BorrowerRate increases the credit score.

On relationships with the categorical variables such as Occupation_new, ProsperCreditRating, LoanStatus, I found them to be as expected. On LoanStatus although I saw that once the loan is Completed or about to be completed the credit score tends to fall below those who’s status is current.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I tried to see the relationship between StatedMonthlyIncome and Occupation_new to understand if my inuition regarding the payscale of different broader occupation types was correct. More or less my intuition turned out to be correct although I can’t say for sure that it was 100% accurate.

What was the strongest relationship you found?

Relationship of BorrowerRate with MeanCreditScore showed the strongest relationship among variables of my interest.

Multivariate Plots Section

MeanCreditScore vs LoanStatus vs IsBorrowerHomeowner

The plot above shows the relationship among MeanCreditScore, LoanStatus and home ownership. It is clearly evident that in most of the cases the non-owners of home have a lower credit score, except when the loan status was “Cancelled”. And the highest scorers are the ones who have have completed their loans and are also home owners, although those with “Current” status are close followers.

MeanCreditScore, BorrowerRate, ProsperCreditRating

Since 36 month term loans are more in number we can see the range of BorrowerRates covered is greater. This plot shows an interesting finding that for higher credit ratings, which correspond to higher credit scores the plot with borrower rate is flat, where as for lower credit rated loans, if the borrower rate is higher the scores tend to be higher and vice versa. Hence if a lower credit rated loan is charged a higher rate then the credit scores would also increase. Now I’ll replace credit rating with Occupation_new and see the trend.

MeanCreditScore, BorrowerRate, Occupation_new

Most of the categories show a similar trend, except the Other (most loans in this Occupation category) which shows a dip in credit score till about 0.26 borrower rate and then a rise. For others as the borrower rate increases the credit score decreases. Student category is clearly lower compared to others which are grouped closely together in the plot.

In the 2nd plot I further divide the plots using income ranges, Students are dominant in 0-25k income range. Interestingly there is nothing plotted in 75-100k income range.

MeanCreditScore, BorrowerRate and IncomeRange

In the plot above it can be seen that the BorrowerRate is negatively correlated with the credit scores. As seen earlier for lower band of income ranges the credit score rises with the rise in borrower rate before reaching a threshold after which it drops. The rise and drop are sharper in case of “Not Displayed” category of income range who can represent more risky borrowers and are not much interested in revealing their income information.

MeanCreditScore, LoanOriginalAmount and ProsperCreditRating

In the couple of charts above I tried to triangulate between 3 variables. The first plot is a line plot of different credit ratings, plotting loan amount vs credit scores. The geom_smooth helps to understand the trends in the busy plot and the trends are obvious and clearly visible.

MeanCreditScore, MonthlyLoanPayment by IncomeRange

The plot above shows relation among income, monthly payyment and credit scores. The increase in monthly payments help increase the score, which is easier for high income earners. Although if a risky borrower with Not displayed income pays higher loan payments, there is a sharper rise in their credit scores.

MeanCreditScore, LoanOriginalAmount by EmploymentStatus

Full-time, Employed and Other categories seem to be getting higher loan amounts and are having higher credit scores. Although Self-employed and Not-available categories have limited access to loan amount beyond around 250k, their median credit scores tend to be slightly higher than other categories.

MeanCreditScore, StatedMonthlyIncome, EmploymentStatus, Term

In the plot above I have tried to see the relation between monthly income, credit score based on employment status and term of the loan. I limited the x and y axes to remove extreme outliers seen in the StatedMonthlyIncome. The credit score can be seen to rise with with income. Full-time and employed tend to have a higher score even for lower salaries due to stability of cash flow to service the loan. The curves are not relatively flat for other categories showing a sharper rise with the income. The variance in slope of different employment status is not as big in loan categories 12 and 60 months.

MeanCreditScore, LoanMonthsSinceOrigination, EmploymentStatus, Term

In the above plot I’m trying to marry 4 variables to slice the data on the 4 dimensions for deeper analysis. It can be seen that as was see in the LoanMonthsSinceOrigination vs MeanCreditScore plot in bivariate plot section, there is a decline in credit scores once the loans are past due. Further seeing by credit rating and income range we see relatively more random pattern of regression curves for $0 or unemployed categories of income ranges. The higher credit ratings maintain a higher score consistently even after drop in ratings in the past due status.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In this section of the project I tried to bring together multiple variables of interest and see relations among 3-4 variables at a time and understand finer nuances of interplay among them. Since the MeanCreditScore is the main variable of my interest, it almost always occupied the y axis. I checked BorrowerRate, LoanOriginalAmount, and StatedMonthlyIncome against the MeanCreditScore and further added catagorical variables such as ProsperCreditRating, EmploymentStatus, IncomeRange, Term and Occupation_new. In most plots my previous understanding about relationships among the variables was reaffirmed.

Were there any interesting or surprising interactions between features?

Some of the interesting findings were as below: I observed sharp rise in credit score, when borrower rate was increased for poorly rated loans. A similar sharp rise is observed whtn monthly income is increase for poorly rated loans. The ‘Other’ category in Occupation_new showed a strange U pattern as the borrower rate was increased. I think this category should be broken down furhter to analyze the reason for this behavior. Not all employment status categories were granted loans of value greater than 250k

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

The grid of graphs above depicts the variations of credit scores with different categorical variables that I explored in my analysis above. It can be seen that credit rating category on the loan is an important determinant when deciding the credit score of an individual. Similarly the home ownership as well creates a stark distinction in the credit scores. Although there are overlaps among all categories in the plots, when it comes to comparing medians the distinction is obvious and clearly visible. When looking at them invidually the variations in credit scores point to obvious finding when it comes to determining credit scores, except in case of Term where the 36 month loans have a lower median credit scores where as the other 2 terms have higher medians. This can be explained by the fact that a considerably large number of loans are 36 month term which covers a broader category of borrowers including relatively higher number of non performing loans and leading to lower scores of borrowers.

Plot Two

Description Two

In this grid of graphs I compare the borrower rate (BorrowerRate), which showed high correlation with credit score, and monthly loan payments (MonthlyLoanPayment) for different income ranges. BorrowerRate is negatively correlated with the credit scores. As seen earlier for lower band of income ranges the credit score rises with the rise in borrower rate before reaching a threshold after which it drops. The rise and drop are sharper in case of “Not Displayed” category of income range who can represent more risky borrowers and are not much interested in revealing their income information. For higher income ranges there is a steady decline with borrower rate. Although at higher rates the income range becomes immaterial.

In the 2nd plot in the grid, the increase in monthly payment helps the credit score and more so for the relatively more risky borrowers in “Not Displayed” income range category, which shows a sharper rise in credit scores. From the line plots it can also be seen that different income ranges are able to afford monthly payments proportional to their earning capacity and the curves stop plotting beyond a particular monthly payment, for e.g. the 10,000+ goes farthest while 25-50k income range goes only upto about 1200.

Plot Three

Description Three

In this plot I try to relate the credit score with loan ages. In the bivariate plot analyis I observed that after an age of 60 months, which is the maximum term for loan age the credit scores dropped sharply. In this plot I have further segmented the loans by credit rating and also for different income ranges. Interesting to note is the pattern of fall of credit scores for different credit ratings. The decline is sharper for B, C, D, E and HR in the order, specifically when they are part of 0, Not displayed or Not emploed income range categories. In the income ranges from $50,000 to over $100,000 we see a consistency in the curves of different credit ratings tightly grouped together and falling as the loans become ast due.


Reflection

This has been quite an overwhelming exercise of trying to find relationships among diferent features of a dataset using plots, given that there are 81 features to be analysed. The size of dataset which has more than 110,000 records is large enough to help me find patterns without feeling the need for more data to analyze which can be the case sometimes with smaller datasets. Although the high number of records can also throw some crazy outliers. I have tried to find factors affecting the credit scores of borrowers based on my past understaning of this subject and reconfirming those assumptions during analysis of this dataset. Although it will require much more reverse engineering process to know the recepie with all the ingredients and proportions of all the features required to determine the credit score.

To summarize my process in this analysis was to categorize the 81 variables into broader categories and then create a list of variables that I would use, analyze them individually and then in pairs or triplets to lead me to intresting insights. In the process I created more features, experimented with different plot types and documented my findings. There is still a lot more that can be done with the data and I think I have just touched the surface.

Even though I limited my number of variables to analyze to about 10, still the number of combinations that can be analyzed is enormous and this is where some technique like principal component analysis and feature selection and reduction with the help of machine learning algorithms can be of great help. Handpicking features for analysis would require a great amount of intuition about the data which can only come after working with the data for a considerable amount of time.

For further analysis I would try to work with loan origination dates to see the trends of multiple variables over a period of time. Also can be factored in, the economic events during the period these loans were being serviced. I have also ignored the investor side of the story, who or how many people are investing and how much. What kinds of profits or losses they are experiencing. As an overall measure we can also come up with a meature of how successful the Prosper platform has been, is the rising volumes of loans is a good enough measure for success or whether is certainly creating a win-win situation for all the stakeholders in the arena.